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CONVERSION  TABLE 
Conversion  factors  for  U.S.  Customary 
to  metric  (SI)  units  of  measurement. 


Convert  from  I  To  |  Multiply  3y 


angstrom 

meters  (m> 

1.000  000  X  E  -:o  , 

atmospnere  (normal) 

kilo  pascal  (kPa) 

1.013  25  X  E  .2 

bar 

kilo  pascal  (kPa) 

1.000  000  X  E  -2 

oa  r  n 

meter  2  (m2  ) 

1.000  000  X  E  -28  1 

British  thermal  unit  (thermochemical ' 

joula  (J) 

1.054  350  X  E  -3 

calorie  ( thermochemical) 

joula  ( J) 

4.104  000 

cal  (thermochemical) /cm2 

mega  joule/m ^  (MJ/m^) 

4.184  000  X  E  -2 

curie 

•giga  becquerel  (GBq) 

3.700  000  X  E  *  1  l 

degree  (angle) 

radian  (rad) 

1.345  329  X  E  -2 

degree  Fahrenheit 

degree  kelvin  (K) 

*  K  *  (  t°  f  *  459.63) /:  .9 

electron  volt 

joule  (J) 

1.602  19  X  E  -19 

erg 

joule  (J) 

1.000  000  X  E  “7 

e  rg/ second 

wait  <W) 

1.300  000  X  E  -7 

foot 

Batar  On) 

3.048  000  X  E  -1 

foot-pound- force 
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1.355  818 

gallon  (U.S.  liquid) 

matar3  (b3  ) 
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inch 

meter  (m) 

2.540  000  X  E  -2 

Jerk 

joula  (J) 
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joule/kilogram  (J/kg>  (radiation  dose  absorbed) 

Gray  (Gy) 

1.000  000 

ki  lotona 
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nawcan  (Nl 
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6.894  357  X  E  ‘3 

ktap 
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1.000  000  X  E  -l 

micron 

meter  (m) 

1.000  000  X  E  -6 

mil 

Batar  (m) 

2.540  000  X  E  -5 

mile  ( international) 

meter  (m) 

1.609  344  X  E  *3 

ounce 

kilogram  (kg) 

2.834  952  X  E  -2 

pound- force  (lbs  avoirdupois) 

newton  (N) 

4.448  222 

pound-force  inch 

newton-meter  (N.m) 

1.129  848  X  E  -1 

pound-force/ inch 

newton/meter  (N/m) 

1.751  268  X  E  -2 
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pound-force/inch^  (psi) 

Kilo  pascal  (kpa) 

6.894  757 
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4.214  011  X  E  -2 
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kilogram/meter ^  (kg/m^) 

1.601  846  X  E  +1 

rad  (radiation  dose  absorbed) 

“Gray  (Gy) 

1.000  000  X  E  -2 

roentgen 

coulomb/ kilogram  (C/kg) 

2.579  760  X  E  -4 

shake 

second  (s) 

1.000  000  X  E  -8 

slug 

kilogram  (kg) 

1.459  390  X  E  +1 

torr  (mm  Hg,  0  C) 

kilo  pascal  (kPa) 

1.333  22  X  E  -1 

The  becquerel  (Bq)  is  the  SI  unit  of  radioactivity;  1  Bq  *  1  event/s. 
The  Gray  (GY)  is  the  SI  unit  of  absorbed  radiation. 
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SECTION  1 
INTRODUCTION 


1-1  HIGHLIGHTS  OF  PRIOR  EFFORTS . 

This  report  summarizes  the  continuation  of  previous 
studies  of  digital  voice  radio  communications  in  the  presence  of 
fading  due  to  high  altitude  nuclear  device  detonation  [1].  Most 
of  the  effort  in  this  study  was  invested  in  developing  and 
testing  a  Fourier  transform  coding  technique  designated 
"Selective  Frequency  Coding"  (SFC) . 

Previous  efforts  demonstrated  that  the  standard  2400 
baud  LPC-10  algorithm  degrades  word  intelligibility  rapidly  when 
bit  error  rates  exceed  1-2%,  and  that  word  intelligibility 
suffers  more  in  a  fading  environment  than  in  a  random  noise 
environment.  Furthermore,  these  previous  efforts  indicate  that 
minor  adjustments  to  the  LPC-10  algorithm  will  not  significantly 
mitigate  this  loss  of  intelligibility  in  a  fading  environment. 

The  current  work  continues  the  search  for  a  voice 
communications  system  which  is  robust  to  bit  errors  in  fading. 
Since  the  transmission  channel  and  its  protocol  are  fixed, 
modifying  the  channel  by  adding  time,  frequency  or  spatial 
diversity  was  deemed  beyond  the  scope  of  this  study.  Instead, 
extensions  to  the  previous  work  focused  on  examining 
alternatives  to  the  LPC-10  algorithm  which  might  prove  more 
robust  against  typical  errors  in  fading  environments. 

1-2  BASIS  FOR  THE  FFT  VOCODER. 


Transform  domain  coders  have  been  extensively  reported 
in  the  literature  as  methods  of  achieving  toll  quality  voice 
compression  at  medium  bit  rates  (8  kbps  -  16  kbps) .  These 
coders  fall  into  two  broad  categories:  those  which  use  the 
slowly  changing  envelope  of  the  voice  spectrum  to  efficiently 
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allocate  bits  to  various  spectral  regions  [2],  and  those  which 
explicitly  model  voice  as  a  sum  of  narrowband  sinusoids  [3]. 

MAXIM  has  chosen  to  investigate  the  latter 
sum-of-sinusoids  model  as  a  potential  approach  for  achieving 
highly  intelligible  synthetic  quality  voice  in  a  2400  bps  scheme 
which  is  more  intelligible  than  LPC  in  fading  environments. 

This  approach  is  intuitively  appealing,  since  the  various 
sinusoid  components  when  coded  independently  should  carry 
comparable  amounts  of  information.  Consequently,  an  error  in 
the  bits  representing  one  of  these  components  would  not 
completely  alter  the  character  of  the  voice  sound  as  in  LPC,  but 
would  instead  result  in  some  sort  of  background  noise.  One  might 
expect  that  the  human  ear  would  be  able  to  sort  through  the 
background  noise,  maintaining  word  intelligibility  despite  the 
bit  errors.  Experiments  have  verified  that  this  conjecture 
holds . 


The  technique  is  conceptually  simple.  A  256  point  FFT 
is  performed  on  the  input  voice,  and  the  6  largest  peaks  in  the 
spectral  amplitude  are  found.  The  locations,  amplitude  and 
phase  of  each  of  these  peaks  are  transmitted  to  the  receiver, 
where  a  reconstruction  is  performed  in  a  way  that  ensures 
waveform  continuity  across  frame  boundaries.  The  result  is 
speech  that  sounds  somewhat  reverberant,  but  is  highly 
intelligible,  retains  a  high  degree  of  talker  recognizability , 
requires  no  error-prone  pitch  estimation,  and  can  be  implemented 
with  low  cost,  current  generation  digital  signal  processing 
devices . 


1-3  REPORT  OVERVIEW. 


The  body  of  this  report  will  present  a  brief  summary  of 
the  characteristics  of  voice  signals  in  general  to  provide  a 
foundation  for  describing  the  technique,  then  discuss  the  design 
trade-offs  and  present  the  intelligibility  test  results.  We 
will  conclude  with  some  recommendations  for  future  study  in  the 
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area  of  achieving  survivable  low  bit-rate  voice  communications 
systems . 

1-4  SUMMARY  OF  RESULTS. 

The  work  performed  under  this  contract  has  resulted  in 
a  2400  baud  voice  coding  technique  that  permits  intelligible 
reception  in  the  presence  of  5-10%  bit  error  rates.  LPC-10 
typically  becomes  very  difficult  to  understand  at  5-6%  error 
rate  in  a  comparable  environment. 
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SECTION  2 

SPECTRAL  CHARACTERISTICS  OF  SPEECH 


2-1  MODEL  OF  SPEECH  GENERATION. 

Speech  sounds  can  be  classified  into  two  broad  groups: 
voiced  sounds  such  as  vowels  (a, e, i, o, u, y)  and  sibilant 
consonants  (n,l);  and  unvoiced  sounds  such  as  t,  ch,  and  p.  The 
voiced  sounds  are  created  by  periodic  pops  of  air  released  by 
the  glottis  in  the  throat.  These  impulses  of  air  are  filtered 
by  the  resonances  of  the  throat  and  mouth  to  form  a  ringing 
waveform  that  repeats  with  a  basic  pitch  period  as  illustrated 
in  Figure  2-1.  Unvoiced  sounds  are  created  by  suddenly 
releasing  pressure  developed  behind  the  lips  or  tongue  as  in 
"p",  "k",  or  "t",  or  by  forcing  air  through  a  constriction  as  in 
the  sounds  "ch"  and  "s".  Unvoiced  sounds  are  not  periodic,  but 
instead  resemble  random  noise. 

Both  voiced  and  unvoiced  sounds  can  be  modeled  by 
factoring  the  sound  into  2  components:  an  excitation  function 
and  a  filter  which  modifies  the  excitation  function.  For  voiced 
sounds,  the  excitation  function  is  a  periodic  pulse  train 
representing  the  glottal  pulses,  and  the  filter  represents  the 
spectral  shaping  imposed  by  the  resonances  of  the  throat  and 
mouth.  For  unvoiced  sounds,  the  excitation  function  is  a  white 
noise  source  and  the  filter  shapes  the  noise  spectrum 
appropriately  for  the  type  of  noise  sound. 

2-2  VOCODER  CONCEPT. 


Most  low  bit  rate  vocoders  achieve  their  savings  in  bit 
rate  from  the  very  simple  models  of  the  excitation  functions 
described  above:  for  voiced  sounds,  only  a  pitch  period  and 
amplitude  need  to  be  transmitted,  while  unvoiced  sounds  only 
require  an  excitation  amplitude  since  the  excitation  function  is 
white  noise. 
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Figure  2-1.  LPC  vocoder  structure. 

This  factoring  of  voiced  sounds  into  two  components  can  also  be 
interpreted  in  the  spectral  domain.  The  vocal  tract  filter 
imposes  a  smooth  overall  envelope  to  the  spectrum,  while  the 
periodicity  of  the  waveform  causes  the  fine  structure  within 
this  envelope  consisting  of  a  series  of  spectral  lines 
replicated  at  harmonics  of  the  pitch  period. 


|s(f)| 


SPECTRAL  ENVELOPE: 


WAVEFORM  PERIODICITY 


Figure  2-2.  Short  term  voice  spectrum. 

2-3  SFC  SPEECH  MODEL. 

In  contrast  to  the  pitch-excited  vocoder  voice  model, 
the  selective  frequency  coding  technique  does  not  factor  voice 
into  an  excitation  function  followed  by  a  slowly  changing 
filter.  Furthermore,  SFC  does  not  require  an  explicit  pitch 
estimate  to  account  for  the  periodicity  of  voiced  sounds. 
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Instead,  SFC  relies  on  the  fact  that  periodic  signals  give  rise 
to  a  Fourier  spectrum  containing  distinct  spectral  lines. 

Rather  than  forcing  these  spectral  lines  to  be  exact  harmonic 
multiples  with  the  specific  phase  relationship  corresponding  to 
a  periodic  waveform,  SFC  simply  chooses  the  largest  peaks  and 
transmits  the  locations,  amplitudes,  and  phases  of  these  peaks. 
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SECTION  3 

DEVELOPMENT  BACKGROUND 


This  section  will  begin  with  a  description  of  the 
hardware  used  to  digitize  and  reconstruct  the  voice.  This  will 
be  followed  by  brief  descriptions  of  the  various  algorithms 
tested  in  the  course  of  developing  the  final  SFC  algorithm. 

3-1  DIGITIZING  EQUIPMENT. 

The  equipment  used  in  this  study  to  digitize  and 
reconstruct  voice  consisted  of  a  tape  deck,  a  stereo  equalizer, 
a  simple  audio  filter,  a  DEC  LPA-11A  A/D-D/A  system,  and  a  VAX 
11/750  computer. 

Initially,  digitizing  was  performed  at  20  kHz, 
permitting  a  10  kHz  signal  bandwidth.  This  sample  rate  was 
later  reduced  to  the  telephone  industry  standard  8  kHz  since 
peaks  are  rarely  chosen  in  the  4-10  kHz  region,  and  the  8  kHz 
rate  yields  an  appropriate  time  span  when  a  256  sample  frame 
size  is  used. 

A  National  Semiconductor  AF-134  filter  device  was  used 
in  both  digitizing  and  reconstruction  operations.  The  equalizer 
was  used  to  provide  slight  pre-emphasis  which  was  found  to 
enhance  the  intelligibility  of  some  male  talkers  significantly. 
The  equalizer  was  also  used  to  provide  appropriate  gain.  Figure 
3-2  shows  the  frequency  response  of  the  filter/equalizer 
combination  when  the  equalizer  is  set  for  flat  response. 

The  LPA-11A  digitizer  has  an  input  range  of  +/-  5V 
which  converts  to  a  range  of  0-4095.  The  output  samples  are 
offset  binary  with  2048  representing  0  volts.  After 
characterizing  the  harmonic  distortion  of  the  digitizer  as  a 
function  of  input  amplitude,  it  was  determined  that  the 
converter  should  be  driven  at  roughly  70%  of  its  maximum  dynamic 
range  to  minimize  intermodulation  distortion. 
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Figure  3-1.  Digitizing  equipment. 
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INITIAL  EXPERIMENTS. 


3-2.1  Algorithm  Using  Largest  Spectral  Bins. 

Initial  efforts  for  the  FFT  vocoder  attempted  to  retain 
the  most  spectral  power  by  coding  the  largest  spectral  bins  of 
the  FFT.  Since  the  chosen  frequency  elements  had  the  largest 
magnitudes  (see  Figure  3-3) ,  they  might  be  expected  to  retain 
most  of  the  waveform  characteristics. 
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Figure  3-3.  Largest  spectral  bin  analysis  algorithm. 

The  sampled  input  speech  was  analyzed  through  a  fast 
Fourier  transform  (FFT) ,  and  the  magnitude  and  phase  of  the  bins 
containing  the  largest  magnitudes  were  transmitted.  The 
receiver  then  reconstructed  the  waveform  using  an  inverse  FFT. 

The  speech  quality  using  this  technique  was  poor  at  low 
data  rates.  The  frequency  bin  selection  was  concentrated 
largely  in  the  low  frequencies  below  1  kHz,  producing  speech 
which  sounded  muffled  due  to  the  lack  of  high  frequency 
content.  At  data  rates  above  8  kbps,  very  natural,  highly 
intelligible  speech  was  produced  due  to  the  retention  of  more 
frequency  components  including  those  in  the  upper  passband. 

In  addition  to  the  lack  of  high  frequency  information 
at  low  data  rates,  the  reconstructions  suffered  from 
discontinuities  at  the  frame  boundaries.  These  discontinuities 
were  perceived  as  clicks  occurring  at  about  30  Hz. 
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Figure  3-4.  Largest  spectral  bin  reconstruction. 
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3-2.2  Algorithm  Using  Intraframe  Spectral  Differences. 

The  previous  results  confirmed  the  intuitive  notion 
that  preserving  more  frequency  bins  improves  the  reconstructed 
speech  quality,  so  some  effort  was  aimed  at  retaining  more 
information  by  relying  on  the  fact  that  the  spectrum  often 
changes  slowly  from  frame  to  frame.  By  retaining  a  spectrum 
that  models  the  past  speech  frames  and  transmitting  the  bins  of 
the  current  frame  that  create  the  largest  differential  we  can 
preserve  more  information,  transmit  less  redundant  information, 
and  produce  higher  quality  reconstructions.  This  approach  is 
depicted  in  Figure  3-5. 
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Figure  3-5.  Intraframe  spectral  difference  algorithm. 

The  speech  is  processed  as  before  through  an  FFT.  The 
resultant  spectrum  is  compared  to  a  running  spectrum,  and  the 
magnitudes  and  phases  of  the  bins  creating  the  largest 
difference  are  coded  and  transmitted.  The  running  spectrum  is 
calculated  from  frequency  information  of  previous  frames,  and  is 
decayed  with  time. 
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The  speech  processed  in  this  manner  suffered  from  an 
unacceptable  reverberance .  Several  attempts  that  were  made  to 
reduce  this  reverberance  met  with  some  success;  however  the 
resulting  quality  was  ultimately  not  acceptable  at  rates  below  8 
kbps . 


3-3  ALGORITHM  USING  MAXIMUM  SPECTRAL  PEAKS. 


3-3.1  Algorithm  Concept . 

One  of  the  problems  with  the  speech  quality  from  the 
previous  two  attempts  was  that  the  spectrum  was  mainly  being 
selected  in  the  lower  end  of  the  speech  band,  and  the 
reconstructed  speech  consequently  sounded  muffled.  An  alternate 
method  was  attempted  that  considered  only  those  FFT  bins  that 
corresponded  to  local  maxima  in  the  spectrum  as  shown  in  Figure 
3-6.  To  the  extent  that  the  short  term  spectrum  is  represented 
by  a  small  number  of  sine  waves,  choosing  the  peak  spectral  bins 
of  an  FFT  corresponds  to  slightly  shifting  the  frequencies  of 
these  sines  since  they  would  not  generally  fall  directly  on  an 
FFT  bin  frequency,  but  would  nevertheless  be  represented  as 
sines  at  exact  bin  frequencies.  Since  the  periodic  nature  of 
the  voice  gives  rise  to  a  line  spectrum,  we  might  expect  this 
model  to  work  adequately  for  voiced  frames.  With  unvoiced 
frames,  the  hope  was  that  the  non-harmonic  relationship  of  the 
bins  chosen,  together  with  the  unstructured  phases  of  these  bins 
would  adequately  represent  noise-like  sounds. 

In  summary,  the  approach  was  to  transform  blocks  of  the 
input  data  via  an  FFT  and  transmit  the  amplitudes  and  phases  of 
the  bins  with  the  largest  local  maxima  in  the  spectrum.  This 
frequency  selection  involved  more  of  the  speech  band  which 
resulted  in  higher  quality  speech  than  the  previous  two 
approaches:  the  voice  was  more  pleasant  with  the  muffled 

quality  largely  eliminated.  While  the  speech  sounded  somewhat 
synthetic,  high  intelligibility  and  some  degree  of  talker 
recognition  was  maintained. 
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Figure  3-6.  Maximum  spectral  peak  algorithm. 


3-3.2  Quantization  Scheme. 


At  this  point  more  attention  was  turned  to  the  actual 
bit  coding  which  involved  choosing  an  ultimate  compression 
scheme  that  would  permit  coding  the  maximum  number  of  peaks,  and 
finalize  an  ultimate  frame  size. 


A  32  ms  frame  size  was  found  to  be  best  for  an  8  kHz 
sampling  rate.  If  the  frame  si2e  was  less,  then  fewer  peaks 
must  be  used  to  satisfy  the  data  rate  limitation.  Longer  frames 
would  smear  rapid  changes  that  speech  could  have.  The  32  ms 
window  also  gates  out  256  point  samples  which  is  a  convenient 
size  appropriate  for  FFT  processing. 

With  the  32  ms  window  chosen,  7  bits  are  needed  for  the 
frequency  bin  coding,  leaving  3  bits  for  the  magnitude,  and  3 
bits  for  the  phase.  The  resultant  array  from  the  256  point  FFT 
will  contain  real  and  imaginary  spectrum  that  are  mirror  images 
of  each  other.  Thus  only  128  frequency  bins  need  to  be  searched 
and  coded,  so  7  bits  suffice  to  uniquely  specify  the  frequency 
bin  number.  A  logarithmic  ulaw  compression  scheme  was  chosen  to 
code  the  magnitude.  The  magnitude  was  compressed  by: 

(log  (1  +  u*m) 

M= - 

(log  (1  +  u) ) 

M  =  compressed  magnitude 
m  =  magnitude 
u  =  ulaw  factor 
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The  u  factor  was  chosen  experimentally  and  the  value  of 
100  appeared  to  work  best.  Three  bits  were  sufficient  and 
resulted  only  in  a  slight  amount  of  distortion  due  to  the 
quantization . 

For  coding  the  phase  a  3  bit  linear  compression  was 
used,  making  the  resolution  pi/4  radians  with  a  maximum  error 
pi/8.  With  these  bit  allocations  we  could  choose  6  frequency 
peaks  if  we  coded  the  five  largest  elements  as  described,  and 
the  sixth  largest  with  only  2  bits  for  magnitude,  and  2  bits  for 
phase.  This  was  possible  since  the  sixth  element  is  normally 
lower  in  energy  and  does  not  require  the  resolution  of  the 
previous  five  elements.  This  leaves  the  data  rate  at  2375  bps 
with  25  bps  free  for  synchronization. 

3-4  INITIAL  INTELLIGIBILITY  TESTS. 

In  the  development  of  a  new  voice  coding  scheme  the 
quality  must  be  measured  to  evaluate  its  performance.  In  a 
benign  case  there  will  be  no  errors  and  the  speech  will  be  at 
its  best  performance.  To  measure  this  performance  the 
Diagnostic  Rhyme  Test  (DRT)  is  used.  This  test  measures  the 
listener's  ability  to  distinguish  between  two  similar  sounding 
words.  In  an  errored  case,  particularly  in  the  fading 
environment,  the  errors  corrupt  large  portions  of  entire  words. 
Since  DPT  scores  insufficiently  measure  intelligibility  in  these 
cases,  the  Phonetic  Alphabet  Comprehension  Test  (PACT)  was 
developed  [ 1 ] . 

The  PACT  was  able  to  measure  the  listeners  ability  to 
comprehend  a  fixed  set  of  words  in  an  errored  environment. 
Consequently,  PACT  scores  along  with  the  DRT  scores  will  give  a 
good  indication  of  the  overall  performance  of  the  reproduced 
voice  quality. 
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3-4  . 1 


Diagnostic  Rhyme  Test  (DRT)  . 


The  DRT  consists  of  a  set  cf  words  that  demonstrate  the 
phonemic  features  of  speech  (see  Table  3-1):  voicing,  nasality, 
sustention,  sibilation,  graveness,  and  compactness.  Pairs  of 
words  are  used  with  one  stressing  the  phonemic  feature,  and  the 
other  not.  Of  these  two,  one  is  chosen  and  processed.  If  the 
difference  can  be  depicted,  by  choosing  the  correct  word  during 
evaluation,  then  the  algorithm  succeeds  in  retaining  this 
particular  phonemic  feature.  If  the  algorithm  does  well  in  all 
areas,  then  it  can  be  determined  as  being  completely 
intelligible.  If  not,  it  must  be  determined  if  it  is  acceptable 
for  the  specific  application.  We  chose  six  speakers,  three  male 
and  three  female,  of  varying  backgrounds,  to  randomly  choose  and 
read,  from  the  pairs  of  words  of  each  group,  indicating  their 
choices  and  the  order  they  read  them.  The  speech  samples  were 
then  applied  to  the  algorithms  under  test  at  varying  data 
rates.  For  each  different  data  rate  the  resultant  reconstructed 
speech  was  evaluated  by  6  different  listeners,  who  were  also  of 
varying  backgrounds.  They  would  indicate  their  choice  of  the 
word  they  thought  was  spoken  of  the  two  choices  in  each  word 
pair.  This  would  then  be  compared  to  the  actual  words  spoken  to 
determine  if  the  choice  was  correct  and  with  what  types  of 
phonemic  sounds  the  algorithm  had  difficulty  with.  The 
comprehension  results  were  then  compiled,  corrected  for 
guessing,  and  conclusions  derived. 

3-4.2  Phonetic  Alphabet  Comprehension  Test  (PACT). 

For  the  nuclear  stressed  environment  the  Phonetic 
Alphabet  Comprehension  Test  (PACT)  was  used.  This  test 
evaluates  how  the  performance  measured  by  the  DRT  is  affected  by 
an  environment  such  as  a  fading  nuclear  environment.  The  PACT 
concentrates  on  the  overall  comprehension  of  a  message  rather 
than  the  individual  sounds  as  with  the  DRT.  This  is  applicable 
in  the  fading  nuclear  environment  since  the  errors  tend  to  be 
clustered.  The  result  is  the  loss  of  complete  words  or  large 
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Table  3-1 


DRT  word  pairs 


Stimulus  Words  used  in  the  DRT 


VOICING 

NASALITY 

SUSTENTION 

\biced—  Unvoiced 

Nasal— Oral 

Sustained— Interrupted 

veal— feel 

meat— beat 

vee~ bee 

bean— peen 

need— deed 

sheet -cheat 

gin— chin 

mitt— bu 

vill— bill 

dint— cine 

nip— dip 

thick— tick 

zoo— Sue 

moot— boot 

foo— pooh 

dune— tune 

news— dues 

shoes— choose 

voal— foal 

moan— bone 

those— doze 

goat— coat 

note— dote 

though— dough 

zed— said 

mend— bend 

then— den 

dense— cense 

neck— deck 

fence— pence 

vast— fast 

mad— bad 

than— Dan 

gaff— calf 

nab— dab 

shad— chad 

vault— fault 

moss— boss 

thong— tong 

daunt— taunt 

gnaw— daw 

shaw— chaw 

tack— chock 

mom— bomb 

von— bon 

bond— pond 

knock— dock 

vox— box 

SIBILATION 

GRAVENESS 

COMPACTNESS 

Mated— Unsibilated 

Grave— Acute 

Compact— Diffuse 

zee— thee 

weed— teed 

yield— wield 

cheep — keep 

peak— teak 

key— tea 

lilt-gill 

bid-dld 

hit— fit 

sing— thing 

fin— thin 

gill -dill 

|u  ice— goose 

moon  — noon 

coop— poop 

chew— coo 

pool— tool 

you— cue 

Joe-go 

bowl— dole 

ghost— boast 

sole— thole 

fore— thor 

show— so 

test— guest 

met— net 

keg-peg 

chair— care 

pent— tent 

sen- wren 

jab— dab 

bonk  -dank 

gat— bat 

sank— thank 

fad— thad 

shag- sag 

taws— gauze 

fought --thought 

yawl— wall 

saw— thaw 

bond— dong 

caught —taught 

jot-got 

wad  — rod 

hop— fop 

chop— cop 

pot— tot 

got- dot 
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parts  of  them,  so  a  test  to  measure  the  comprehension  of  a 
sequence  of  words  is  needed.  The  actual  PACT  considers  a 
sequence  of  military  phonetic  words  (alpha,  bravo,  Charlie, 

. ..yankee,  zulu)  (Table  3-2)  that  have  been  passed  through  a 
fading  communications  channel.  The  actual  sequence  is  chosen 
randomly  from  the  26  phonetic  words.  The  performance  is 
measured  by  the  listener's  ability  to  detect  and  correctly 
comprehend  each  word  in  the  order  they  were  presented.  The 
results  of  the  PACT  are  used  in  combination  with  the  DRT  scores. 
Any  PACT  score  less  than  100  indicates  a  loss  in  intelligibility 
from  the  benign  case  measured  by  the  DRT.  A  PACT  score  of  100 
indicates  a  performance  that  is  comparable  to  the  DRT  score, 
therefore,  a  100  PACT  score  indicates  intelligibility  of  the 
speech  in  a  benign  case. 

Table  3-2.  Military  phonetic  alphabet. 

PHONETIC  ALPHABET  UST 


A  - 

Alpha 

N  - 

November 

B  - 

Bravo 

O  - 

Oscar 

C  - 

Charlie 

P  - 

Papa 

D  - 

Delta 

Q  - 

Quebec 

E  - 

Echo 

R  - 

Romeo 

F  - 

Foxtrot 

S  - 

Sierra 

G  - 

Golf 

T  - 

Tango 

H  - 

Hotel 

U  - 

Uniform 

1  - 

India 

V  - 

Victor 

J  - 

Juliet 

W  - 

Whiskey 

K  - 

Kilo 

X  - 

X-ray 

L  - 

Lima 

Y  - 

Yankee 

M  - 

Mike 

Z  - 

Zulu 
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3-4.3 


Bit  Error  Application. 


Evaluating  errors  due  to  a  noisy  or  a  fading 
environment  requires  modeling  the  satellite  link  and  the 
propagation  channel.  The  link  or  system  model  includes  the 
hardware  used  to  protect  against  errors  by  coding,  modulating, 
and  interleaving  the  data.  The  channel  model  used  in  our 
studies  simulates  the  fading  due  to  multipath  and  phase 
distortion  of  high  altitude  nuclear  effects  (KANE) .  Coded 
speech  data  is  processed  through  these  models  to  measure  the 
effects  of  the  bit  errors. 

3-4. 3.1  Channel  Model. 

The  channel  model  simulates  the  effects  of  the 
environment  on  the  data  transmission,  including  Gaussian 
background  noise,  signal  path  losses,  and  fading  due  to 
multipath.  Two  groups  of  channel  model  parameters  were  employed 
in  this  study.  In  one  set,  a  severe  multipath  model  was 
applied,  and  in  the  other  set,  only  white  Gaussian  noise  was 
applied.  The  parameters  were  adjusted  in  the  two  sets  to 
provide  two  cases  of  aggregate  bit  error  rates:  1%  and  10%.  In 
a  multipath  environment,  occasional  deep  fades  result  in  a  very 
high  percentage  of  bit  errors  during  the  fade,  so  errors  will 
occur  in  bursts.  This  fading  case  is  contrasted  to  the  case  of 
a  model  with  no  multipath,  in  which  the  errors  are  distributed 
randomly . 


The  Rayleigh  fading  model  employed  allows  the  time 
scale  of  fading  to  be  modified  by  setting  the  fade  decorrelation 
time.  While  fade  depth,  duration,  and  separation  are  randomly 
generated  by  this  model,  a  short  decorrelation  time  will 
typically  result  in  fades  of  short  duration  separated  by  short 
time  periods.  Conversely,  a  long  decorrelation  time  will  cause 
the  multipath  channel  model  to  evolve  more  slowly,  providing 
longer  fade  durations  with  longer  periods  between  the  fades  on 
the  average. 
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Since  average  fade  duration  is  proportional  to 
decorrelation  time,  and  since  the  bit  error  rate  during  a  fade 
is  extremely  high,  using  a  long  decorrelation  time  (say,  1  sec) 
will  result  in  the  loss  of  entire  syllables  or  words,  seriously 
impacting  word  intelligibility  of  the  voice.  With  short  fade 
durations,  some  frames  of  most  words  will  be  corrupted  with 
errors,  but  enough  of  the  frames  are  properly  received  to  allow 
reasonable  intelligibility. 

3-4. 3. 2  System  Model. 

To  evaluate  the  performance  of  a  voice  coding  scheme  in 
a  satellite  system,  a  specific  transmitter  and  receiver  must  be 
simulated.  Consequently,  we  must  define  coding  schemes, 
modulation  schemes,  and  other  processes  that  may  be  used  in  a 
system  for  error  correction.  We  used  a  model  of  an  existing 
satellite  system  which  is  known  to  perform  well  in  a  white 
Gaussian  noise  environment.  This  link  model  includes 
convolutional  coding,  soft  Viterbi  decoding,  differential  phase 
shift  keying  (DPSK) ,  and  convolutional  interleaving.  The 
encode/decode  process  used  is  a  rate  1/2  (Rl/2)  convolutional 
code  along  with  a  soft  decision  Viterbi  decoder.  The  encoder 
transforms  the  incoming  data  bit  with  2  symbol  bits,  so  a  2400 
bps  data  stream  will  be  encoded  into  a  4800  bps  symbol  bit 
stream  [6].  In  the  receiver  the  encoded  symbols  are  decoded 
through  a  soft  decision  Viterbi  decoder.  The  soft  decision 
retains  the  quantized  values  of  the  received  bit  stream  that 
indicate  how  the  bit  information  varies  after  going  through  the 
error  channel.  The  path  metric  is  determined  from  these  values 
and  the  symbols  are  decoded  using  Viterbi' s  maximum  likelihood 
algorithm.  The  modulation/  demodulation  scheme  is  differential 
phase  shift  keying  (DPSK) .  Input  to  the  modulator  is  the 
interleaved  coded  symbol  bit  stream,  and  the  output  is  a 
waveform  that  shifts  180  degrees  in  phase  whenever  an  input  bit 
of  logic  1  is  applied.  The  demodulator  will  detect  this  phase 
shift  in  the  waveform,  and  determine  a  quantized  value  in 
accordance  with  this  detected  information.  This  quantized  value 
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is  a  level  which  ultimately  demonstrates  the  confidence  in  which 
the  symbol  decision  can  be  made  by  the  Viterbi  decoder.  For  the 
fading  environment  an  interleaver  will  have  a  large  contribution 
in  the  performance  of  the  system  model.  The  coding  described 
above  will  perform  very  well  in  a  noise  environment  where  the 
errors  are  randomly  distributed.  In  fading,  however,  the  errors 
are  grouped  in  bursts  and  create  bit  error  sequences  that  the 
decoder  cannot  correct.  An  interleaver  will  take  its  input  bits 
and  create  a  new  bit  sequence  through  a  convolution,  so  that  no 
contiguous  sequence  of  n2  bits,  of  the  new  sequence,  will 
contain  any  pair  of  symbols  that  are  originally  separated  by  nl 
bits  [7] .  This  new  sequence  is  then  modulated  and  demodulated 
along  with  the  channel  effects.  The  demodulated  sequence  will 
exhibit  the  characteristic  burst  of  errors  associated  with  the 
fading  channel.  But  after  deinterleaving,  the  original  sequence 
of  bits  is  restored  and  the  bursty  bit  errors  will  be  spread  out 
over  the  whole  interleaver  span  and  appear  more  randomly 
distributed  for  better  decoder  performance.  For  the  purpose  of 
a  demonstration  of  the  fading  environment  performance  a 
convolutional  interleaver  of  size  24  x  384,  interleaver  time 
span  (Ts)  of  1.92  seconds,  was  used. 

3-4.4  Test  Results. 

3-4. 4.1  DRT  Results. 

The  Diagnostic  Rhyme  Test  (DRT)  was  used  to  measure  the 
intelligibility  of  the  algorithm  using  the  spectral  peak  method 
described  in  Section  3.3.  The  talkers  for  the  test  consisted  of 
three  male  talkers  and  three  female  talkers.  They  varied  in 
age,  nationality,  and  first  languages.  The  listeners  for  this 
test  also  varied  in  age,  sex,  nationality,  and  first  languages. 
After  each  test  was  taken,  the  listeners  were  allowed  to  express 
experiences  and  observations  about  the  test.  These  initial 
tests  were  used  on  the  algorithm  operating  at  96  kbps,  6.5  kbps, 
4.6  kbps,  and  2.4  kbps.  The  96  kbps  corresponds  to  the 
uncompressed  case  and  will  be  used  for  a  baseline.  The  effect 
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of  data  rate  on  the  intelligibility  can  be  seen  in  Table  3-3, 
where  the  intelligibility  falls  steadily  with  decreasing  data 
rate.  The  average  adjusted  (corrected  for  guessing)  and 
unadjusted  scores  were  as  follows: 

Table  3-3.  Comprehensive  DRT  scores. 

Date  Rate  (kbps)  Unadjusted  (%)  Adjusted  (%) 


96 

94 

89 

6.5 

84 

68 

4.6 

82 

64 

2.4 

77 

54 

The  scores  show  a  dramatic  difference  between  adjusted 
and  unadjusted  since  there  were  only  two  choices  for  each  word 
and  there  was  a  good  chance  of  guessing  the  correct  answer. 

An  interesting  note  is  that  the  uncompressed  speech,  at 
96  kbps,  did  not  produce  a  100%  DRT  score.  This  is  indicative 
of  lapses  in  listeners'  attention,  occasionally  indistinct 
pronunciation,  and  the  effect  of  the  bandlimiting  anti-aliasing 
filter.  In  addition,  it  was  later  discovered  that  the  digital 
voice  did  not  cover  the  entire  dynamic  range  of  the  digitizer. 
This  should  be  noted  when  analyzing  these  DRT  results. 

The  female  talkers  appeared  to  provide  the  most 
intelligible  speech,  as  seen  in  Figure  3-7a,  at  data  rates  of 
2.4  kbps,  and  6.5  kbps.  At  4.5  kbps,  and  96  kbps,  the 
difference  between  male  and  female  talkers  is  insignificant. 

The  data  rate  reduction  also  took  its  toll  on  the 
intelligibility  as  almost  all  the  phonemic  features  decreased 
with  decreasing  data  rate  with  nasality  surviving  the  best,  and 
graveness  suffering  the  most  (see  Figure  3-7b  and  c) . 
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The  responses  from  the  listeners  also  showed  that  their 
performance  improved  as  they  became  more  experienced. 
Demonstrating  this  fact  are  the  resu-is  from  the  experienced 
listeners  who  have  several  hours  of  listening  time.  The 
isolated  experienced  listener  scores  were  as  follows: 

Table  3-4.  Experienced  listener  DRT  scores. 


Data 

Rate  (kbps) 

Unadjusted  (%) 

Adjusted  (%) 

96.0 

98 

96 

6.5 

93 

87 

4.6 

92 

84 

2.4 

91 

82 

Comparing  these 

scores  to  those  of 

the  LPC  DRT  of  88 

[ 1 ] ,  we 

can  see  that  the 

experienced  listener  using  the  maximum 

spectral 

peak  algorithm 

can  perform  as  well 

as  LPC. 

3-4. 4. 2 

PACT  Results. 

The  purpose  of 

these  initial  tests 

including  bit  errors 

was  to  compare  the  performance  of  the  SFC  and  the  LPC  in  similar 
environments.  Bit  errors  were  simulated  by  the  system  model  and 
a  channel  model  with  randomly  distributed  noise  or  noise  and 
fading.  The  bit  error  rates  applied  were  1%  or  10%  and  in  the 
noise  and  fading  cases  had  decorrelation  times  of  Is  (tauO  = 

Is) .  The  word  sets  were  those  described  in  the  PACT  description 
with  three  sets  of  the  random  alphabet  sequence  to  prevent  the 
listener  from  learning  a  sequence.  Both  the  SFC  and  the  LPC-10 
were  run  through  the  same  errored  environment  so  the  results  are 
directly  comparable. 

Figure  3-8  displays  the  results  from  the  initial  PACT. 


The  results  show  that  the  SFC  was  just  as  survivable  as 
the  LPC-10  in  all  cases,  and  slightly  better  in  the  case  of  10% 
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random  noise.  In  the  benign  case  (no  bit  errors)  both 
demonstrate  higher  intelligibility  but  the  scores  are  lower  than 
the  1%  noise  case.  This  is  most  likely  due  to  the  learning 
curve  associated  with  the  synthetic  quality  of  the  speech  and 
with  experience  in  taking  the  tests.  In  1%  noise  both  SFC  and 
LPC  perform  reasonably  well,  maintaining  better  then  90% 
intelligibility.  In  the  10%  noise,  the  intelligibility  of  both 
coding  schemes  suffers  significantly.  The  LPC  performed  more 
poorly  than  the  SFC;  however  the  70-80%  scores  indicate  that 
roughly  1  of  4  words  do  not  survive  the  bit  errors.  In  the 
fading  cases  both  the  LPC  and  the  SFC  survive  the  1%  bit  error 
rate,  but  both  perform  poorly  at  the  10%  bit  error  rates. 

For  an  experienced  listener  there  was  only  a  slight 
variation.  Figure  3-9  displays  the  results  for  the  experienced 
listeners . 

The  10%  noise  case  for  the  SFC  survives  the  errors  with 
nearly  90%  word  recovery  rate  where  as  LPC  could  only  produce  an 
85%  word  recovery  rate.  In  the  fading  case  the  errors  still 
seem  to  be  very  severe  and  neither  LPC  or  SFC  can  recover  from 
the  bit  errors. 

3-4.5  Conclusions. 

The  SFC  voice  using  the  simple  initial  algorithm  sounds 
rather  synthetic  with  some  prominent,  annoying  clicks  caused  by 
the  frame  discontinuities  at  each  boundary.  Despite  these 
artifacts,  experienced  listeners  are  able  to  filter  out  these 
artifacts.  Apparently,  the  experienced  listener  was  much  more 
familiar  with  the  quality  of  the  synthetic  speech,  and  wasn't 
distracted  by  the  unusual  character  of  the  voice. 

When  bit  errors  are  applied,  the  listeners  perform 
about  the  same  with  both  the  LPC  and  the  SFC.  With  high  bit 
error  rates  LPC  was  prone  to  dropping  out  completely,  resulting 
in  losses  of  whole  words,  or  parts  of  words  making  the  phonetic 
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EXPERIENCED  LISTENER  PRELIMINARY  PACT  SCORES 


0%  1%  10%  1%  10% 

Benign  Noise  Noise  Fading  Fading 

Error  Case  (BER) 


•Figure  3-9.  Experienced  listener  PACT  results. 
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word  spoken  unrecognizable.  With  the  SFC,  the  bit  errors 
resulted  in  stray  tones  or  inappropriate  magnitudes  of  the 
frequency  components  chosen.  Consequently,  a  frequency  or 
magnitude  could  be  in  error  for  a  particular  f^ame,  but  there 
still  remained  the  surviving  frequency  components  containing 
valid  information.  Although  part  of  the  errored  word  may  have 
corrupted  frequency  information,  enough  valid  frequency 
information  survives  the  errors  to  make  the  reconstructed  word 
recognizable.  The  recognizability  relied  strongly  on  the 
listener's  ability  to  mentally  filter  out  the  stray  tones  caused 
by  the  errors.  The  nature  of  the  artifacts  from  the  errors,  are 
such  that  they  may  be  corrected  in  most  cases.  These  correction 
schemes,  discussed  in  the  next  section,  are  able  to  recognize 
most  decoded  frequency  and  magnitude  errors  leaving  only  the 
valid  frequency  information  for  the  reconstruction,  resulting  in 
significantly  better  performance  for  SFC. 

The  quality  of  the  SFC  benign  speech,  and  of  its 
performance  in  errors  demonstrated  the  possibility  of  developing 
a  low  bit  rate,  survivable  voice  coding  scheme.  From  here  we 
decided  to  take  a  closer  look  at  techniques  for  improving  the 
quality  of  the  speech  with  bit  errors  using  straightforward 
error  correction  schemes  uniquely  suited  to  this  SFC  coding 
approach. 


3-5  ERROR  CORRECTION  SCHEMES. 


The  majority  of  the  stray  tones  in  SFC  due  to  bit 
errors  can  be  eliminated  by  monitoring  the  decoded  magnitude  and 
frequency  data  and  exploiting  the  structure  of  the  transmitted 
data.  By  ignoring  or  correcting  received  frequency  components 
which  are  clearly  incorrect,  stray  tones  at  inappropriate 
frequencies  can  be  minimized. 

Normally  in  speech  a  smooth  transition  should  occur  if 
the  frequency  peaks  drift,  thus  monitoring  the  peak  selection 
from  frame  to  frame  can  help  correct  errors  in  frequency 
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identification.  Whenever  a  decoded  frequency  bin  is  beyond  the 
maximum  deviation  allowed  from  frame  to  frame  the  bin  decoded  is 
assumed  to  be  in  error  and  is  deleted.  Magnitude  correction  is 
obtained  by  sending  the  magnitudes  for  each  frame  in  descending 
order;  i.e.,  the  first  frequency  bin  sent  has  the  largest 
magnitude,  the  second  has  the  next  largest,  etc..  Should  there 
be  a  decoded  magnitude  at  the  receiver  that  does  not  fall  in 
proper  order,  then  the  magnitude  in  error  is  interpolated 
between  the  surrounding  magnitudes. 

The  result  from  these  two  error  correction  schemes 
eliminated  a  very  large  portion  of  the  stray  tones  due  to 
errors.  The  reconstructed  speech  in  many  cases  still  had  enough 
valid  frequency  information  so  that  word  recognition  was 
maintained.  The  listener  no  longer  had  to  filter  out  the  stray 
tones  and  was  able  to  concentrate  more  on  the  surviving 
information  in  the  effort  to  recognize  the  word. 

3-6  WAVEFORM  CONTINUOUS  ALGORITHM. 

The  success  of  the  intelligibility  experiments 
described  in  the  previous  section  led  us  to  seek  a  method  of 
reducing  the  most  distracting  element  of  the  SFC  voice 
reconstructions:  the  discontinuities  at  frame  boundaries.  We 

felt  that  eliminating  these  clicks  at  the  frame  rate  would  not 
only  improve  the  overall  perceived  quality  of  the 
reconstructions,  but  could  also  add  to  the  intelligibility 
scores . 

In  searching  for  methods  to  eliminate  frame 
discontinuities,  several  references  were  encountered  which 
addressed  this  issue  from  the  perspective  of  high  quality, 
higher  data  rate  coding  [4,5].  Since  the  the  techniques 
described  in  these  references  were  similar  to  techniques  we  had 
begun  pursuing,  we  applied  modified  versions  of  the  reported 
methods  to  achieve  waveform  continuity  for  the  low  rate, 
synthetic  quality  coding  technique  being  studied. 


The  basic  idea  underlying  this  approach  is  to 
synthesize  the  voice  using  sine  wave  generation  rather  than  an 
inverse  FFT  on  the  frame,  and  allow  the  frequency,  phase  and 
amplitude  of  each  of  the  6  sine  wave  components  to  vary  smoothly 
across  the  frame  to  match  with  corresponding  components  in  the 
following  frame.  Components  of  the  two  frames  which  aren't 
close  enough  in  frequency  are  not  forced  into  phase  continuity. 
Thus,  components  which  are  part  of  an  extended  vowel  sound  will 
remain  phase  continuous.  The  result  of  this  synthesis  is  a 
reconstructed  waveform  with  smooth  but  reverberant  and  mildly 
gurgly  vowel  sounds. 

The  specific  interpolation  procedure  employed  connects 
the  amplitudes  of  the  sine  wave  with  a  simple  linear 
interpolation  across  the  frame,  and  the  phase  is  modeled  as  a 
cubic  function  of  time.  This  cubic  polynomial  phase 
representation  allows  the  phase  to  change  smoothly  across  the 
frame  as  described  in  [4]. 

A  quadratic  phase  function  was  also  tried  and  very 
similar  qualitative  results  were  obtained,  indicating  that  a 
simpler  quadratic  phase  function  could  be  used  in  a  real-time 
implementation  of  the  technique.  The  quadratic  function  results 
from  allowing  both  the  frequency  and  the  phase  to  change 
linearly  across  the  frame  to  match  the  initial  samples  of  the 
next  frame;  i.e., 

f  (i)  =  (ai  +  Aa  *  i)  *  sin((w!  +  Aw  *  i)  *  i  +  pi  +  Ap  *  i) 

wi  =  frequency  at  frame  1 
P]_  =  phase  at  frame  1 

Aw  =  (w2-wl) /N  (with  frame  size  N) ,  and 
Ap  =  (p2-pl) /N. 

When  voice  is  synthesized  with  the  interpolation 
described  above  applied  across  the  entire  frame,  the  smooth 
frame  transitions  are  achieved  at  the  expense  of  more  robotic 
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voice  quality.  Furthermore,  it's  clear  from  examining  plots  of 
the  voice  data  that  frames  synthesized  without  interpolation 
more  closely  match  the  initial  waveforms.  Consequently,  we 
sought  a  method  of  ensuring  frame  continuity,  but  without 
disrupting  the  waveform  over  the  entire  frame.  As  a  compromise, 
we  allowed  amplitude,  frequency,  and  phase  interpolation  over 
the  final  quarter  of  each  frame,  but  preserved  the  fixed 
sinusoids  during  the  first  three  quarters  of  the  frame.  The 
result  is  a  reasonable  trade-off  between  the  extremes  of  frame 
clicks  with  no  interpolation  and  robotic  voice  that  lacks  sharp 
transitions  with  full-frame  interpolation. 

To  determine  the  effect  of  this  modification  to  the 
algorithm  on  word  intelligibility  in  various  noise  and  fading 
conditions,  the  PACT  tests  were  administered  again.  The  results 
of  these  tests  are  summarized  in  Figures  3-10  and  3-11.  In 
these  histograms,  the  benign  case  involved  no  bit  errors,  the 
cases  labeled  "noise”  included  randomly  distributed  errors  at 
the  bit  error  rates  indicated,  and  the  cases  labeled  "fade” 
included  bit  errors  obtained  from  the  link  model  in  a  fading 
channel  with  a  1.9  sec  interleaver  at  the  indicated  error  rate. 
Inspection  of  this  histogram  indicates  that  forcing  waveform 
continuity  improved  the  intelligibility  several  percentage 
points . 


Moreover,  this  version  of  SFC  (with  the  error 
correction)  outperformed  standard  LPC  in  all  cases  when  bit 
errors  were  present. 

The  PACT  scores  do  not  reflect  the  fact  that  the 
reconstructed  female  voice  is  significantly  less  warbly  sounding 
than  the  male  voice.  This  is  attributed  to  higher  pitch  of  the 
female  talker,  resulting  in  fewer  total  spectral  peaks  in  the 
analyzed  band.  This  in  turn  increases  the  likelihood  that 
corresponding  peak  locations  will  be  tracked  for  the  entire  span 
of  a  particular  type  of  sound.  For  the  male  talker,  a  peak 
might  be  lower  in  amplitude  for  a  particular  frame  as  a  result 
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MALE  PACT  SCORES 
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Figure  3-10.  PACT  scores  for  final  SFC  -  male  talker. 
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of  the  pitch  pulse  placement  within  a  frame,  for  example.  This 
temporarily  lower  peak  amplitude  can  allow  some  other  peak  to  be 
chosen  occasionally  within  a  given  type  of  sound,  causing  a 
rough  or  warbling  effect  for  the  male  talker. 

Conclusions  that  can  be  drawn  from  this  data  are 
presented  in  section  4.2. 

3-7  ALGORITHM  COMPLEXITY  AND  IMPLEMENTATION. 

In  evaluating  the  proposed  algorithm,  it  is  worthwhile 
to  consider  the  computational  complexity  to  determine  whether 
the  technique  can  be  implemented  at  reasonable  cost.  The 
computations  involved  are  summarized  in  Table  3-5,  which 
provides  rough  estimates  of  the  computation  time  required  by 
each  block  of  processing  using  a  typical,  readily  available  and 
inexpensive  signal  processing  microprocessor,  the  TMS  32010.  As 
can  be  seen  from  this  table,  all  the  required  processing  could 
be  performed  by  one  or  two  such  devices  within  the  frame  time  of 
32  msec.  To  focus  on  the  signal  processing  requirements,  this 
table  assumes  that  line  protocol  issues  such  as  signalling  bits 
and  frame  synchronization  are  handled  by  a  separate  single  chip 
microcontroller . 

Table  3-5.  Estimated  SFC  computation  time  for  TMS  32010. 


Analysis:  256  pt  FFT  (real  input)  4.0  msec 

Compute  Magnitude  Squared  .2 

Pick  Peaks  .2 

Compute  Magnitude  &  Phase  . 6 

Quantize  Peaks  .3 

Pack  data  .5 

Synthesis:  Unpack  data  .5 

Linearize  magnitude  .3 

Match  phase  to  next  frame  .3 

Generate  sines  2.0 

TOTAL  8 . 9  msec 

Time  available  (256  samples)  32.0  msec 
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SECTION  4 

CONCLUSIONS  AND  FUTURE  DIRECTIONS 

4-1  ALGORITHM  SUMMARY. 

The  final  algorithm  for  voice  compression  resulting 
from  this  study  can  be  summarized  as  follows: 

1.  Section  the  input  voice  into  256-sample  blocks.  As 
each  block  is  digitized,  apply  an  FFT . 

2.  Convert  the  {Re,Im}  pairs  of  transformed  data  into 
magnitude  squared  (  Re*Re  +  Im*Im  ) . 

3.  Find  the  FFT  bin  numbers  of  the  local  peaks  in  spectral 
magnitude.  A  peak  bin  is  simply  a  bin  with  an 
amplitude  greater  than  the  amplitudes  in  both  adjacent 
bins . 

4.  Sort  the  peaks  to  determine  the  bin  numbers  of  the  6 
largest  peaks. 

5.  Determine  phase  of  the  6  peak  bins. 

6.  Match  the  peaks  to  corresponding  peaks  in  the  previous 
frame  by  determining  the  bin  in  the  current  frame 
closest  to  each  bin  of  the  previous  frame.  If 
corresponding  peaks  are  more  than  2  bins  apart  (62  Hz), 
then  disqualify  the  match.  If  2  bins  of  the  previous 
frame  link  to  the  same  bin  of  the  current  frame, 
preserve  the  link  with  the  largest  amplitude. 

7.  Linearly  quantize  the  amplitude  and  phase  to  3  bits  for 
each  of  the  largest  5  peak  bins.  These  are  transmitted 
in  order  of  descending  peak  amplitude  to  permit  error 
correction  at  the  receiver.  The  final  (smallest)  peak 
is  transmitted  with  amplitude  quantization  of  2  bits 
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over  half  the  range  of  the  first  peak,  and  with  phase 
quantization  of  2  bits. 


8.  Transmit  the  bin  number  (7  bits)  and  quantized 
amplitude  and  phase  for  each  bin. 

9.  At  the  receiver,  apply  data  checking: 

a.  Ensure  that  peaks  are  specified  in  decreasing 
amplitude  order.  Apply  smoothing  to  ensure  this 
order  if  necessary. 

b.  If  any  peak  locations  are  separated  by  more  than  4 
bins  from  peaks  in  either  the  previous  frame  or 
the  next  frame,  do  not  synthesize  them.  This 
reduces  the  energy. 

10.  For  peaks  that  survive  the  data  checking,  synthesize 
192  samples  of  the  voice  using  a  sum  of  constant  phase 
and  amplitude  sine  waves.  This  can  be  done  either 
using  an  inverse  FFT  or  directly  generating  the 
sinusoids  through  table  lookup,  weighting  them 
appropriately  and  adding. 

11.  The  final  64  samples  of  the  frame  are  generated  by 
determining  the  phase  at  sample  192,  then  computing  for 
each  peak  frequency: 

at  =  (new  amp  -  old  amp)/ 64 

dph  =  (new  phase  -  old  phase) /64 

dk  =  TWOPI* (new  bin  #  -  old  bin  #)/ (256*64) 

Then,  for  all  remaining  64  samples  compute 
s(i)  =  s (i)  +  (al+at*im) *sin ( (wl+dk*im) *i+dph*im+tl) 
where  i=192..256,  and  im=i-l92,  and  al,wl,  and  tl  are 
amplitude,  frequency,  and  phase  of  the  corresponding 
frequency  component  in  the  previous  frame. 

All  256  samples  of  the  frame  have  now  been  computed. 
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4-2  CONCLUSIONS. 

Inspection  of  Figures  3-10  and  3-11  clearly  demonstrate 
four  significant  results: 

1.  Word  intelligibility  of  SFC  with  no  bit  errors  is  about 

100%. 

2.  Word  intelligibility  of  SFC  in  a  10%  randomly 
distributed  bit  error  rate  remains  high  (95%)  while 
word  intelligibility  of  LPC-10  drops  dramatically 
(60-80%) . 

3.  In  fading  with  a  1.9  sec  interleaver  length,  SFC 
intelligibility  degrades  much  more  slowly  than  LPC. 

4.  As  a  consequence  of  result  2,  intelligibility  loss  from 
fading  with  SFC  can  be  avoided  at  up  to  10%  average  bit 
error  rate  by  using  an  interleaver.  This  strategy  does 
not  work  for  LPC. 

Not  apparent  from  the  PACT  scores  is  the  reverberant  quality  of 
the  voice.  This  aspect  of  the  speech  quality  is  initially 
distracting,  but  within  a  short  listening  period  the  listener 
adapts  to  the  reverberance . 

4-3  FUTURE  DIRECTIONS. 

The  positive  results  of  this  development  indicate  that 
several  issues  should  be  studied  further  before  arriving  at  a 
specific  recommendation  for  techniques  to  improve  the 
survivability  of  low  data  rate  voice  communications  systems 
against  atmospheric  nuclear  device  effects. 
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4-3.1  Real-time  SFC  Implementation. 

First,  the  SFC  technique  should  be  implemented  in  a 
real-time  test  set  to  permit  a  more  comprehensive  evaluation  of 
the  voice  quality  acceptability  in  fading  and  non-fading 
environments.  Due  to  the  simplicity  of  the  technique,  this 
implementation  could  be  efficiently  accomplished  using 
off-the-shelf  digital  signal  processor  cards  compatible  with  an 
IBM/AT  computer.  Further  improvements  to  the  coding  technique 
might  also  be  considered  at  this  time.  For  example: 

the  encoder  should  consider  the  peak  locations  of  both 
the  next  frame  and  the  previous  frame  to  choose  the 
peak  set  for  any  specific  frame. 

more  thought  might  be  given  to  coding  the  peak 
locations  efficiently,  then  adding  redundant  coding  to 
protect  against  transmission  errors; 

RELP  (residual-excited  linear  prediction)  vocoder 
techniques  could  be  applied  to  the  received  data  which 
might  permit  smooth  frame  transitions  without  the 
robotic  quality  of  full  frame  interpolation; 

the  possibility  of  coding  z-plane  pole  locations  rather 
than  FFT  bin  number  and  phase  should  be  examined. 

4-3.2  Redundantly  Coded  LPC-VQ. 

In  addition,  other  possible  techniques  should  be 
surveyed.  In  particular,  the  technique  of  using  vector 
quantization  in  conjunction  with  conventional  LPC  vocoders 
should  be  considered.  Vector  quantization  is  a  technique  for 
reducing  the  number  of  bits  required  to  represent  a  frame  of  LPC 
voice,  and  has  typically  been  applied  to  the  problem  of 
transmitting  voice  at  data  rates  below  1000  bits/second.  By 
reducing  the  basic  data  rate,  then  adding  error  protection  to 
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bring  the  data  rate  back  to  2400  baud,  it  may  be  that  LPC  can  be 
made  more  effective  in  a  fading  environment. 

4-3.3  Real-time  Error  Simulator. 

Finally,  some  mechanism  is  needed  to  tie  PACT 
intelligibility  scores  or  DRT  scores  to  the  actual  impact  on 
typical  operations  for  currently  deployed  AN-DVT  units.  This 
could  be  efficiently  accomplished  with  an  inexpensive  unit  which 
corrupts  communications  with  bit  errors  corresponding  to  typical 
atmospheric  nuclear  effects  for  specific  link  models  during  an 
operational  exercise.  Such  a  unit  could  be  assembled  from  a 
standard  personal  computer  with  a  data  base  of  bit  error 
patterns  for  various  combinations  of  decorrelation  times  and 
interleaver  buffer  sizes.  An  operator  could  specify  a 
particular  scenario,  then  the  appropriate  bit  errors  would  be 
injected  into  the  data  stream.  Implementing  an  error  pattern  at 
the  modulation  baseband  is  dramatically  more  cost  effective 
than  performing  such  a  test  with  an  RF  atmospheric  effects 
simulator,  permitting  several  such  units  to  be  used  during  the 
tests  or  as  training  devices  to  prepare  personnel  and  adjust 
communications  protocols  to  cope  with  channels  degraded  by 
atmospheric  effects. 
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